Antoniadis, A., Berruyer, J., & Carmona, R. (1992). Régression non linéaire et applications. Economica.
Data are available on the Kaggle website , URL : https://www.kaggle.com/code/maryamsayagh1/cardiovascular-disease-prediction/data
library(knitr)
knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width = 9, fig.height = 9)
library(devtools)
## Le chargement a nécessité le package : usethis
library(rmarkdown)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(FactoMineR)
library(InformationValue)
library(ISLR)
library(caret)
## Le chargement a nécessité le package : lattice
##
## Attachement du package : 'caret'
##
## Les objets suivants sont masqués depuis 'package:InformationValue':
##
## confusionMatrix, precision, recall, sensitivity, specificity
##
## L'objet suivant est masqué depuis 'package:purrr':
##
## lift
# Read the dataset
df <- read_delim("~/Documents/Info_Stats/R/Kaggle/Cardiovascular/cardio_train.csv", delim = ";")
## Rows: 70000 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (13): id, age, gender, height, weight, ap_hi, ap_lo, cholesterol, gluc, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# df <- read_delim("C:/Documents and Settings/steph/Documents/CardiovascularDisease_LogisticRegression/cardio_train.csv", delim = ";")
df <- as_tibble(df)
df <- df %>% rename_with(toupper)
df
## # A tibble: 70,000 × 13
## ID AGE GENDER HEIGHT WEIGHT AP_HI AP_LO CHOLE…¹ GLUC SMOKE ALCO ACTIVE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 18393 2 168 62 110 80 1 1 0 0 1
## 2 1 20228 1 156 85 140 90 3 1 0 0 1
## 3 2 18857 1 165 64 130 70 3 1 0 0 0
## 4 3 17623 2 169 82 150 100 1 1 0 0 1
## 5 4 17474 1 156 56 100 60 1 1 0 0 0
## 6 8 21914 1 151 67 120 80 2 2 0 0 0
## 7 9 22113 1 157 93 130 80 3 1 0 0 1
## 8 12 22584 2 178 95 130 90 3 3 0 0 1
## 9 13 17668 1 158 71 110 70 1 1 0 0 1
## 10 14 19834 1 164 68 110 60 1 1 0 0 0
## # … with 69,990 more rows, 1 more variable: CARDIO <dbl>, and abbreviated
## # variable name ¹CHOLESTEROL
dim(df)
## [1] 70000 13
This work is based on a set of R functions especially built to fit and assess a logistic regression model. The response variable is the variable CARDIO which is binary. The predictors are quantitative and categorigal.
# Source the R addons That I have coded for the logistic regression under R
# devtools::source_url("https://github.com/stephaneLabs/Logistic_Addons/logisticRegressionAddons_20230209.R")
source("~/Documents/Info_Stats/R/Kaggle/Cardiovascular/logisticRegressionAddons_20230314.R")
##
## Attachement du package : 'DescTools'
## Les objets suivants sont masqués depuis 'package:caret':
##
## MAE, RMSE
## ResourceSelection 0.3-5 2019-07-22
## Le chargement a nécessité le package : Matrix
##
## Attachement du package : 'Matrix'
## Les objets suivants sont masqués depuis 'package:tidyr':
##
## expand, pack, unpack
##
## Attachement du package : 'arules'
## L'objet suivant est masqué depuis 'package:dplyr':
##
## recode
## Les objets suivants sont masqués depuis 'package:base':
##
## abbreviate, write
## Le chargement a nécessité le package : carData
##
## Attachement du package : 'car'
## L'objet suivant est masqué depuis 'package:arules':
##
## recode
## L'objet suivant est masqué depuis 'package:DescTools':
##
## Recode
## L'objet suivant est masqué depuis 'package:dplyr':
##
## recode
## L'objet suivant est masqué depuis 'package:purrr':
##
## some
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#source("C:/Documents and Settings/steph/Documents/CardiovascularDisease_LogisticRegression/logisticRegressionAddons_20230314.R")
# Quick format of the data
df <- df %>% mutate(ID = ID %>% as.numeric(),
AGE = round(as.numeric(AGE) / 365, 1),
HEIGHT = HEIGHT %>% as.numeric(),
WEIGHT = WEIGHT %>% as.numeric(),
BMI = WEIGHT / (HEIGHT / 100)^2,
AP_HI = as.numeric(AP_HI) / 10,
AP_LO = as.numeric(AP_LO) / 10,
GENDER = GENDER %>% as.factor(),
CHOLESTEROL = CHOLESTEROL %>% as.factor(),
GLUC = GLUC %>% as.factor(),
SMOKE = SMOKE %>% as.factor(),
ALCO = ALCO %>% as.factor(),
ACTIVE = ACTIVE %>% as.factor(),
CARDIO = CARDIO %>% as.factor()
)
nrow_init <- nrow(df)
We compute here the BMI index, the approximative age in years, and we divide systolic and diastolic blood pressure by 10 to get the usual range that we know when taking blood pressure a the physician.
observations are considered as outliers when outside ]Q1 - 1.5 x IQR; Q3 + 1.5 x IQR[ range, where Q1 is the 1st Quartile, Q3 the 3rd quartile and IQR the inter-quartiles interval.
quantitativeVariableDescription(df = df, variableName = "AGE")
## Lenght of the variable :
## 70000
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.60 48.40 54.00 53.34 58.40 65.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df <- removeOutliers(df, variable = "AGE")$without_df
## Number of rows of initial dataframe : 70000
## Number of rows of toiletted dataframe : 69996
## Percent of rows deleted : 1 %
quantitativeVariableDescription(df = df, variableName = "HEIGHT")
## Lenght of the variable :
## 69996
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.0 159.0 165.0 164.4 170.0 250.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "WEIGHT")
## Lenght of the variable :
## 69996
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 65.00 72.00 74.21 82.00 200.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "BMI")
## Lenght of the variable :
## 69996
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.472 23.875 26.378 27.557 30.222 298.667
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
df <- removeOutliers(df, variable = "BMI")$without_df
## Number of rows of initial dataframe : 69996
## Number of rows of toiletted dataframe : 68001
## Percent of rows deleted : 0.97 %
df <- removeOutliers(df, variable = "AP_HI")$without_df
## Number of rows of initial dataframe : 68001
## Number of rows of toiletted dataframe : 65059
## Percent of rows deleted : 0.96 %
df <- removeOutliers(df, variable = "AP_LO")$without_df
## Number of rows of initial dataframe : 65059
## Number of rows of toiletted dataframe : 62011
## Percent of rows deleted : 0.95 %
# systolic blood pressure AP_HI must be superior to diastolic blood pressure AP_LO
df <- df %>% filter(AP_HI > AP_LO)
nrow_clean <- nrow(df)
pct_clean <- ((nrow_init - nrow_clean) / nrow_init) * 100
cat("The cleaned dataset represent ")
## The cleaned dataset represent
cat(pct_clean)
## 11.41571
cat("% of the initial data (in terms of number of rows of the initial dataframe)\n")
## % of the initial data (in terms of number of rows of the initial dataframe)
# Strange paterns wich make think that 0 are NA for some variables
df_quanti <- select(df, c(ID, AGE, HEIGHT, WEIGHT, BMI, AP_HI, AP_LO, CARDIO)) %>% column_to_rownames(var = "ID")
# panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
# {
# par(usr = c(0, 1, 0, 1))
# r <- abs(cor(x, y))
# txt <- format(c(r, 0.123456789), digits = digits)[1]
# txt <- paste0(prefix, txt)
# if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
# text(0.5, 0.5, txt, cex = cex.cor * r)
# }
#
# pairs(df_quanti, upper.panel = panel.cor, col = c("black", "red")[as.numeric(df_quanti$CARDIO)])
quantitativeVariableDescription(df = df, variableName = "AGE")
## Lenght of the variable :
## 62009
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.10 48.50 54.00 53.37 58.40 65.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "BMI")
## Lenght of the variable :
## 62009
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.48 23.88 26.22 27.02 29.75 39.74
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "AP_HI")
## Lenght of the variable :
## 62009
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.50 12.00 12.00 12.62 14.00 16.90
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
quantitativeVariableDescription(df = df, variableName = "AP_LO")
## Lenght of the variable :
## 62009
## Variable summary :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.600 8.000 8.000 8.164 9.000 10.400
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# PCA of data, NA are replaced by the mean of the variable
# PCA replace missing values by the mean of the variable, some variable are highly correlated (colinearity problems ?)
df_PCA <- PCA(X = df_quanti, scale.unit = TRUE, quali.sup = 7, ncp = 6, graph = FALSE)
# 6 components explains 89% of the data
barplot(df_PCA$eig[,2])
kable(df_PCA$eig)
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 2.2698131 | 37.8302176 | 37.83022 |
| comp 2 | 1.4042520 | 23.4042000 | 61.23442 |
| comp 3 | 1.1395236 | 18.9920605 | 80.22648 |
| comp 4 | 0.8895977 | 14.8266285 | 95.05311 |
| comp 5 | 0.2930822 | 4.8847028 | 99.93781 |
| comp 6 | 0.0037314 | 0.0621907 | 100.00000 |
# Discrimination of the 2 diagnostics are not very good in the 1st plane of the PCA
plot(df_PCA,axes = c(1,2), choix = "var")
plot(df_PCA,axes = c(1,2), choix = "ind", habillage = 7)
plot(df_PCA,axes = c(1,3), choix = "var")
plot(df_PCA,axes = c(1,3), choix = "ind", habillage = 7)
plot(df_PCA,axes = c(2,3), choix = "var")
plot(df_PCA,axes = c(2,3), choix = "ind", habillage = 7)
We check here the relationship between the log odds of the diabetes and the predictors. A linear relationship has to be observed to use the predictor as quantitative. If the relationship is not quantitative a transformation or a transformation into factor has to be done.
To get the values for the data of our plot, groups are done with a discretization function and the means of the groups are plotted to see is we get a linear shape. The method used for discretization is the “cluster” option of the function “discretize” of the “arules” package. This method is based on k-means clustering.
# description between the log(odds) and the quantitative predictors (should be linear)
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AGE")
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "HEIGHT")
# Problem
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "WEIGHT")
# Problem
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "BMI")
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AP_HI")
# Acceptable
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AP_LO")
# HEIGHT is discretized
df <- df %>% mutate(HEIGHT_grouped = as.factor(discretize(HEIGHT, method = "frequency", breaks = 3)))
attr(df$HEIGHT_grouped, "discretized:breaks") <- NULL
attr(df$HEIGHT_grouped, "discretized:method") <- NULL
HEIGHT_groupedFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "HEIGHT_grouped")
# Frequencies
kable(HEIGHT_groupedFreq$frequencies)
| [120,161) | [161,168) | [168,207] | |
|---|---|---|---|
| 0 | 9887 | 9808 | 11844 |
| 1 | 10084 | 9193 | 11193 |
# Proportions
kable(HEIGHT_groupedFreq$proportions)
| [120,161) | [161,168) | [168,207] | Sum | |
|---|---|---|---|---|
| 0 | 15.94 | 15.82 | 19.10 | 50.86 |
| 1 | 16.26 | 14.83 | 18.05 | 49.14 |
| Sum | 32.21 | 30.64 | 37.15 | 100.00 |
# Chi2 test of independence
print(HEIGHT_groupedFreq$independanceChi2)
##
## Pearson's Chi-squared test
##
## data: responseVsPredictorFrequencies
## X-squared = 21.823, df = 2, p-value = 1.825e-05
# Proportions conditioned by outcome
kable(HEIGHT_groupedFreq$ConditionnalProportions1)
| [120,161) | [161,168) | [168,207] | Sum | |
|---|---|---|---|---|
| 0 | 31.35 | 31.10 | 37.55 | 100 |
| 1 | 33.09 | 30.17 | 36.73 | 100 |
# Proportions conditioned by predictor
kable(HEIGHT_groupedFreq$ConditionnalProportions2)
| [120,161) | [161,168) | [168,207] | |
|---|---|---|---|
| 0 | 49.51 | 51.62 | 51.41 |
| 1 | 50.49 | 48.38 | 48.59 |
| Sum | 100.00 | 100.00 | 100.00 |
plotProportionsResponseConditionnedByPredictor(HEIGHT_groupedFreq$ConditionnalProportions2, "HEIGHT")
GENDERFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "GENDER")
GENDERFreq
## $frequencies
## categoPred
## binaryResp 1 2
## 0 20259 11280
## 1 19546 10924
##
## $proportions
## categoPred
## binaryResp 1 2 Sum
## 0 32.67 18.19 50.86
## 1 31.52 17.62 49.14
## Sum 64.19 35.81 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: responseVsPredictorFrequencies
## X-squared = 0.046658, df = 1, p-value = 0.829
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 1 2 Sum
## 0 64.23 35.77 100.00
## 1 64.15 35.85 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 1 2
## 0 50.9 50.8
## 1 49.1 49.2
## Sum 100.0 100.0
plotProportionsResponseConditionnedByPredictor(GENDERFreq$ConditionnalProportions2, "GENDER")
CHOLESTEROLFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "CHOLESTEROL")
CHOLESTEROLFreq
## $frequencies
## categoPred
## binaryResp 1 2 3
## 0 26560 3306 1673
## 1 20385 4802 5283
##
## $proportions
## categoPred
## binaryResp 1 2 3 Sum
## 0 42.83 5.33 2.70 50.86
## 1 32.87 7.74 8.52 49.14
## Sum 75.71 13.08 11.22 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test
##
## data: responseVsPredictorFrequencies
## X-squared = 2944.2, df = 2, p-value < 2.2e-16
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 1 2 3 Sum
## 0 84.21 10.48 5.30 100.00
## 1 66.90 15.76 17.34 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 1 2 3
## 0 56.58 40.77 24.05
## 1 43.42 59.23 75.95
## Sum 100.00 100.00 100.00
plotProportionsResponseConditionnedByPredictor(CHOLESTEROLFreq$ConditionnalProportions2, "CHOLESTEROL")
GLUCFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "GLUC")
GLUCFreq
## $frequencies
## categoPred
## binaryResp 1 2 3
## 0 27900 1834 1805
## 1 25104 2510 2856
##
## $proportions
## categoPred
## binaryResp 1 2 3 Sum
## 0 44.99 2.96 2.91 50.86
## 1 40.48 4.05 4.61 49.14
## Sum 85.48 7.01 7.52 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test
##
## data: responseVsPredictorFrequencies
## X-squared = 471.39, df = 2, p-value < 2.2e-16
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 1 2 3 Sum
## 0 88.46 5.82 5.72 100.00
## 1 82.39 8.24 9.37 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 1 2 3
## 0 52.64 42.22 38.73
## 1 47.36 57.78 61.27
## Sum 100.00 100.00 100.00
plotProportionsResponseConditionnedByPredictor(GLUCFreq$ConditionnalProportions2, "GLUC")
SMOKEFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "SMOKE")
SMOKEFreq
## $frequencies
## categoPred
## binaryResp 0 1
## 0 28575 2964
## 1 27935 2535
##
## $proportions
## categoPred
## binaryResp 0 1 Sum
## 0 46.08 4.78 50.86
## 1 45.05 4.09 49.14
## Sum 91.13 8.87 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: responseVsPredictorFrequencies
## X-squared = 22.161, df = 1, p-value = 2.507e-06
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 0 1 Sum
## 0 90.60 9.40 100.00
## 1 91.68 8.32 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 0 1
## 0 50.57 53.90
## 1 49.43 46.10
## Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(SMOKEFreq$ConditionnalProportions2, "SMOKE")
ALCOFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "ALCO")
ALCOFreq
## $frequencies
## categoPred
## binaryResp 0 1
## 0 29786 1753
## 1 28942 1528
##
## $proportions
## categoPred
## binaryResp 0 1 Sum
## 0 48.03 2.83 50.86
## 1 46.67 2.46 49.14
## Sum 94.71 5.29 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: responseVsPredictorFrequencies
## X-squared = 9.0248, df = 1, p-value = 0.002663
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 0 1 Sum
## 0 94.44 5.56 100.00
## 1 94.99 5.01 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 0 1
## 0 50.72 53.43
## 1 49.28 46.57
## Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(ALCOFreq$ConditionnalProportions2, "ALCO")
ACTIVEFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
categoricalPredictor = "ACTIVE")
ACTIVEFreq
## $frequencies
## categoPred
## binaryResp 0 1
## 0 5727 25812
## 1 6405 24065
##
## $proportions
## categoPred
## binaryResp 0 1 Sum
## 0 9.24 41.63 50.86
## 1 10.33 38.81 49.14
## Sum 19.56 80.44 100.00
##
## $independanceChi2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: responseVsPredictorFrequencies
## X-squared = 80.494, df = 1, p-value < 2.2e-16
##
##
## $ConditionnalProportions1
## categoPred
## binaryResp 0 1 Sum
## 0 18.16 81.84 100.00
## 1 21.02 78.98 100.00
##
## $ConditionnalProportions2
## categoPred
## binaryResp 0 1
## 0 47.21 51.75
## 1 52.79 48.25
## Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(ACTIVEFreq$ConditionnalProportions2, "ACTIVE")
# Splitting the data into a training and a testing set
# Rename the levels of the factors so that the train function works
make_factor <- function(x){return(factor(x, labels = make.names(levels(x))))}
# format data set
df <- df %>% mutate(CARDIO = make_factor(CARDIO))
df <- df %>% mutate(GENDER = make_factor(GENDER))
df <- df %>% mutate(CHOLESTEROL = make_factor(CHOLESTEROL))
df <- df %>% mutate(GLUC = make_factor(GLUC))
df <- df %>% mutate(SMOKE = make_factor(SMOKE))
df <- df %>% mutate(ALCO = make_factor(ALCO))
df <- df %>% mutate(ACTIVE = make_factor(ACTIVE))
# make this example reproducible
set.seed(1)
#Use 70% of dataset as training set and remaining 30% as testing set
sample_train <- sample(c(TRUE, FALSE), nrow(df), replace = TRUE, prob = c(0.7,0.3))
df_train <- df[sample_train, ]
df_test <- df[!sample_train, ]
# Null model
m0 <- glm(CARDIO ~ 1, family=binomial(link = logit),
data = df_train)
AIC(m0)
## [1] 60033.83
# Cross validation with 10 folds, asks for output with ROC, sensitivity and specificity (summaryFunction = twoClassSummary)
ctrl <- trainControl(method = "cv", number = 10, summaryFunction = twoClassSummary, classProbs = TRUE, savePredictions = "all")
# Fit a logistic regression model and use k-fold CV to evaluate performance
model <- train(CARDIO ~ AGE + GENDER + HEIGHT_grouped + BMI + AP_HI + AP_LO + CHOLESTEROL + GLUC + SMOKE + ALCO + ACTIVE,
data = df_train, method = "glm", family = "binomial", trControl = ctrl)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
#view summary of k-fold CV
print(model)
## Generalized Linear Model
##
## 43314 samples
## 11 predictor
## 2 classes: 'X0', 'X1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 38982, 38982, 38983, 38983, 38982, 38982, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7857292 0.7924085 0.6504285
print(summary(model))
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9229 -0.9172 -0.4218 0.9301 2.4096
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.793979 0.175028 -73.097 < 2e-16 ***
## AGE 0.050335 0.001706 29.497 < 2e-16 ***
## GENDERX2 -0.058036 0.027413 -2.117 0.034250 *
## `HEIGHT_grouped[161,168)` 0.068356 0.028436 2.404 0.016224 *
## `HEIGHT_grouped[168,207]` 0.126915 0.030612 4.146 3.38e-05 ***
## BMI 0.031999 0.002668 11.992 < 2e-16 ***
## AP_HI 0.655234 0.012947 50.610 < 2e-16 ***
## AP_LO 0.123807 0.021431 5.777 7.60e-09 ***
## CHOLESTEROLX2 0.375804 0.034673 10.839 < 2e-16 ***
## CHOLESTEROLX3 1.120715 0.045610 24.572 < 2e-16 ***
## GLUCX2 0.011151 0.046153 0.242 0.809086
## GLUCX3 -0.384565 0.049960 -7.697 1.39e-14 ***
## SMOKEX1 -0.144589 0.043877 -3.295 0.000983 ***
## ALCOX1 -0.303036 0.053872 -5.625 1.85e-08 ***
## ACTIVEX1 -0.235352 0.027555 -8.541 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 60032 on 43313 degrees of freedom
## Residual deviance: 48700 on 43299 degrees of freedom
## AIC: 48730
##
## Number of Fisher Scoring iterations: 4
print(model$results)
## parameter ROC Sens Spec ROCSD SensSD SpecSD
## 1 none 0.7857292 0.7924085 0.6504285 0.008741415 0.009990765 0.009705827
kable(exp(coef(model$finalModel)))
| x | |
|---|---|
| (Intercept) | 0.0000028 |
| AGE | 1.0516232 |
| GENDERX2 | 0.9436157 |
HEIGHT_grouped[161,168) |
1.0707462 |
HEIGHT_grouped[168,207] |
1.1353205 |
| BMI | 1.0325169 |
| AP_HI | 1.9255924 |
| AP_LO | 1.1317975 |
| CHOLESTEROLX2 | 1.4561624 |
| CHOLESTEROLX3 | 3.0670461 |
| GLUCX2 | 1.0112132 |
| GLUCX3 | 0.6807467 |
| SMOKEX1 | 0.8653778 |
| ALCOX1 | 0.7385722 |
| ACTIVEX1 | 0.7902923 |
resultsROC <- rocCurve(trainedModel = model, df_train = df_train, df_test = df_test, outcomeVariable = "CARDIO")
## Training data set
## -----------------
## Dimensions of the testing data set : 43314 15
## length of the outcome variable : 43314
## length of the predicted variable : 43314
## Train data set
## Optimization method : Youden metric
## Training data set
## -----------------
## Dimensions of the testing data set : 18695 15
## length of the outcome variable : 18695
## length of the predicted variable : 18695
## Test data set
## Optimization method : Youden metric
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
kable(resultsROC$optim_train)
| FPR | TPR | MCER | threshold | YoudenMetric | TLCMetric | |
|---|---|---|---|---|---|---|
| 46 | 0.2868899 | 0.7347605 | 0.2758692 | 0.45 | 0.4478707 | 0.3907144 |
kable(resultsROC$optim_test)
| FPR | TPR | MCER | threshold | YoudenMetric | TLCMetric | |
|---|---|---|---|---|---|---|
| 48 | 0.3083442 | 0.7543989 | 0.276491 | 0.47 | 0.4460547 | 0.394203 |